先來看電影評分網站IMDb資料長相,抓取需要的資訊存起來,程式碼是參考自這篇文章。
Take a look at how IMDb save the movie info. Get the info we want and save them down. Code reference.
# 載入所需套件 Import the packages
from pyquery import PyQuery as pq
import pandas as pd
def get_movie_info(movie_url):
"""
從特定電影連結頁面取得資訊 Get movie info from a certain IMDb url
"""
d = pq(movie_url)
movie_rating = float(d("strong span").text()) # 抓取電影評分
movie_genre = [x.text() for x in d(".subtext a").items()] # 抓取電影類型
movie_released_date = movie_genre.pop() # 抓取電影上映日期
movie_poster = d(".poster img").attr('src') # 抓取電影海報網址
movie_cast = [x.text() for x in d(".primary_photo+ td a").items()] # 抓取電影演員
# 回傳電影資訊 return the movie info
movie_info = {
"Rating": movie_rating,
"Released_Date": movie_released_date,
"Genre": movie_genre,
"Poster_Link": movie_poster,
"Cast": movie_cast
}
return movie_info
# 抓一筆電影資料看看 get the info of a movie to have a look
the_dressmaker = get_movie_info("https://www.imdb.com/title/tt2910904/")
print(the_dressmaker)
# 存成資料框架看一下 transform the info we get into dataframe
df = pd.DataFrame.from_dict(the_dressmaker, orient='index')
df.transpose()
本篇程式碼請參考Github。The code is available on Github.
文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.
Reference 參考資料:
[1] 透過操控瀏覽器擷取網站資料
[2] What version of Chrome do I have?
[3] ChromeDriver - WebDriver for Chrome
[4] IMDb
[5] Stack Overflow